home
***
CD-ROM
|
disk
|
FTP
|
other
***
search
/
PD Collection CD 1
/
PD Collection CD 1.iso
/
textual
/
agrep
/
Docs
/
agrep_txt
next >
Wrap
Text File
|
1994-08-24
|
13KB
|
283 lines
AGREP(l) AGREP(l)
June 11, 1991
NAME
agrep - search a file for a string or regular expression, with
approximate matching capabilities
SYNOPSIS
agrep [ -#cdehilnpsvwxDIS ] pattern [ filename... ]
DESCRIPTION
agrep searches the input filenames (standard input is the default) for
records containing strings which either exactly or approximately match
a pattern. A record is by default a line, but it can be defined
differently using the -d option (see below). Normally, each record
found is copied to the standard output. Approximate matching allows
finding records that contain the pattern with several errors including
substitutions, insertions, and deletions. For example, Massechusets
matches Massachusetts with two errors (one substitution and one
insertion). Running agrep -2 Massechusets foo outputs all lines in
foo containing any string with distance at most 2 from Massechusets.
agrep supports many kinds of queries including arbitrary wild cards,
sets of patterns, and in general, arbitrary regular expressions. See
PATTERNS below. It supports most of the options supported by the grep
family plus several more (but it is not 100% compatible with grep).
For more information on the algorithm used by agrep see Wu and Manber,
"Fast Text Searching With Errors," Technical report #91-11, Department
of Computer Science, University of Arizona, June 1991 (available by
anonymous ftp from cs.arizona.edu inside agrep/agrep.tar as agrep.ps).
As with the rest of the grep family, the characters `$', `^', `*',
`[', `^', `|', `(', `)', `!', `;', and `\' can cause unexpected
results when included in the pattern, as these characters are also
meaningful to the shell. To avoid these problems, one should always
enclose the entire pattern argument in single quotes, i.e., 'pattern'.
Do not use double quotes ("). agrep works only on text (ascii) files.
If the file is binary, for example, then agrep will generate an error
message. Only one error message will be generated even if the file
list contains many binary files. When agrep is applied to more than
one input file, the name of the file is displayed preceding each line
which matches the pattern. The filename is not displayed when
processing a single file, so if you actually want the filename to
appear, use /dev/null as a second file in the list.
OPTIONS
-# # is a non-negative integer (at most 8) specifying the maximum
number of errors permitted in finding the approximate matches
(defaults to zero). Generally, each insertion, deletion, or
substitution counts as one error. It is possible to adjust the
relative cost of insertions, deletions and substitutions (see -I
-D and -S options).
-c Display only the count of matching lines.
-d 'delim'
- 1 - Formatted: August 24, 1994
AGREP(l) AGREP(l)
June 11, 1991
Define delim to be the separator between two records. The
default value is '$', namely a record is by default a line.
delim can be a string of size at most 8 (with possible use of ^
and $), but not a regular expression. Text between two delim's
is considered as one record. For example, -d '$$' defines
paragraphs as records and -d '^From ' defines mail messages as
records. agrep matches each record separately. This option does
not currently work with regular expressions. delim cannot
currently contain special control characters.
-e pattern
Same as a simple pattern argument, but useful when the pattern
begins with a `-'.
-h Do not display filenames.
-i Case-insensitive search - e.g., "A" and "a" are considered
equivalent.
-l List only the files that contain a match.
-n Each line that is printed is prefixed by its line number in the
file.
-p Find lines in the text that contain a supersequence of the
pattern. For example,
agrep -p DCS foo will match "Department of Computer Science."
This option has the same function as -I0, which sets the cost of
insertion to zero.
-s Work silently, that is, display nothing except error messages.
This is useful for checking the error status.
-v Inverse mode - display only those lines that do not contain the
pattern.
-w Search for the pattern as a word - i.e., surrounded by non-
alphanumeric characters. The non-alphanumeric must surround the
match; they cannot be counted as errors. For example, agrep -w
-1 car will match cars, but not characters.
-x The pattern must match the whole line.
-Ik Set the cost of an insertion to k (k is a non-negative integer).
This option does not currently work with regular expressions.
-Dk Set the cost of a deletion to k (k is a non-negative integer).
This option does not currently work with regular expressions.
-Sk Set the cost of a substitution to k (k is a non-negative
integer). This option does not currently work with regular
- 2 - Formatted: August 24, 1994
AGREP(l) AGREP(l)
June 11, 1991
expressions.
PATTERNS
agrep supports a large variety of patterns, including simple strings,
strings with classes of characters, sets of strings, wild cards, and
arbitrary regular expressions.
Strings
any sequence of characters, including the special symbols `^' for
beginning of line and `$' for end of line. The special
characters listed above ( `$', `^', `*', `[', `^', `|', `(', `)',
`!', and `\' ) should be preceded by `\' if they are to be
matched as regular characters. For example, \^abc\\ corresponds
to the string ^abc\, whereas ^abc corresponds to the string abc
at the beginning of a line.
Classes of characters
a list of characters inside [] (in order) corresponds to any
character from the list. For example, [a-ho-z] is any character
between a and h or between o and z. The symbol `^' inside []
complements the list. For example, [^i-n] is the same as [a-ho-
z]. The symbol `.' stands for any symbol (don't care). The
symbol `^' thus has two meanings, but this is consistent with
egrep.
Boolean operations
agrep supports an `and' operation `;' and an `or' operation `,',
but not a combination of both. For example, 'fast;network'
searches for all records containing both words.
Wild cards
The symbol '#' is used to denote a wild card. # matches zero or
any number of arbitrary characters. For example, ex#e matches
example. The symbol # is equivalent to .* in egrep. In fact, .*
will work too, because it is a valid regular expression (see
below), but unless this is part of an actual regular expression,
# will work faster.
Combination of exact and approximate matching
any pattern inside angle brackets <> must match the text exactly
even if the match is with errors. For example, <mathemat>ics
matches mathematical with one error (replacing the last s with an
a), but mathe<matics> does not match mathematical no matter how
many errors we allow.
Regular expressions
The syntax of regular expressions in agrep is in general the same
as that for egrep. The union operation `|', Kleene closure `*',
and parentheses () are all supported. Currently '+' is not
supported. Regular expressions are currently limited to
approximately 30 characters (generally excluding meta
- 3 - Formatted: August 24, 1994
AGREP(l) AGREP(l)
June 11, 1991
characters). Some options (-d, -w, -x, -D, -I, -S) do not
currently work with regular expressions. The maximal number of
errors for regular expressions that use '*' or '|' is 4.
EXAMPLES
agrep -2 -c ABCDEFG foo
gives the number of lines in file foo that contain ABCDEFG within
two errors.
agrep -1 -D2 -S2 'ABCD#YZ' foo
outputs the lines containing ABCD followed, within arbitrary
distance, by YZ, with up to one additional insertion (-D2 and -S2
make deletions and substitutions too "expensive").
agrep -5 -p abcdefghij /usr/dict/words
outputs the list of all words containing at least 5 of the first
10 letters of the alphabet in order. (Try it: any list starting
with academia and ending with sacrilegious must mean something!)
agrep -1 'abc[0-9](de|fg)*[x-z]' foo
outputs the lines containing, within up to one error, the string
that starts with abc followed by one digit, followed by zero or
more repetitions of either de or fg, followed by either x, y, or
z.
agrep -d '^From ' 'breakdown; (inter|arpa|bit)net' mbox
outputs all mail messages (the pattern '^From ' separates mail
messages in a mail file) that contain breakdown and one of either
internet, arpanet, or bitnet.
agrep -d '$$' -1 '<word1> <word2>' foo
finds all paragraphs that contain word1 followed by word2 with
one error in place of the blank. In particular, if word1 is the
last word in a line and word2 is the first word in the next line,
then the space will be substituted by a newline symbol and it
will match. Thus, this is a way to overcome separation by a
newline. Note that -d '$$' (or another delim which spans more
than one line) is necessary, because otherwise agrep searches
only one line at a time.
agrep '^agrep' <this manual>
outputs all the examples of the use of agrep in this man pages.
SEE ALSO
ed(1), ex(1), grep(1V), sh(1), csh(1).
BUGS
This is the first release of agrep. Expect some bugs, especially for
more complicated patterns. Any bug reports or comments will be
appreciated! Please mail them to sw@cs.arizona.edu or
udi@cs.arizona.edu There may be problems when control characters
- 4 - Formatted: August 24, 1994
AGREP(l) AGREP(l)
June 11, 1991
(e.g., <ctrl>A ) are used as part of a string or delimiter. Regular
expressions do not support the '+' operator (match 1 or more instances
of the preceding token). These can be searched for by using this
syntax in the pattern:
'pattern(pattern)*'
(search for strings containing one instance of the pattern, followed
by 0 or more instances of the pattern). sometimes adds an empty line
to the output. The following can cause an infinite loop: agrep
pattern * > output_file. If the number of matches is high, they may
be deposited in output_file before it is completely read leading to
more matches of the pattern within output_file (the matches are
against the whole directory). It's not clear whether this is a "bug"
(grep will do the same), but be warned. patterns are currently
limited to approximately 30 characters. Lines are limited to 1024
characters. Records are limited to 8K, and may be truncated if they
are larger than that.
DIAGNOSTICS
Exit status is 0 if any matches are found, 1 if none, 2 for syntax
errors or inaccessible files.
- 5 - Formatted: August 24, 1994